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(54) Identifying web-log data representing a single user session 



(57) Tracking the actions of an Internet user in- 
volves loading data from the transaction log of an Inter- 
net server into a database system. The data includes 
an entry for each request to the Internet server, including 
information identifying which user submitted the request 
and information identifying the time at which the request 
was received. The database system recreates the ac- 
tions, or ciickstream, of a particular user by selecting all 
entries associated with that user and corresponding to 
a single user session. 
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Description 

[0001] Companies that do business on the Internet are beginning to realize that they could improve sales and cus- 
tomer service by tracking the actions of individual customers who visit the companies' Web sites. To this end, many 
5 companies have begun using the data collected by Web servers in trying to reconstruct the ■dickstreams 0 of individual 
customers visiting those Web sites. The challenge, however, lies in making sense of the vast amount of data collected 
by Web servers during the course of even a single day. 

[0002] In general, a Web server records a 8 hiT in its Web log each time a visitor requests a piece of data from the 
server. Studies suggest that each request for a Web page produces, on average, five hits to the web server - one hit 

w for HTML text and four hits for other objects, such as images and audio clips, associated with the Web page. Given 
that individual users often request several Web pages per minute and that Web sites typically host scores of concurrent 
users, even a moderately busy Web site often experiences millions, sometimes billions, of hits each day. Reconstructing 
even a single page view for a single customer requires combing through hundreds, even thousands, of pages of Web- 
log data. Reconstructing the entire dickstream for a particular customer is a daunting task indeed. 

is [0003] Tracking the actions of an Internet user involves loading data from the transaction log of an Internet server 
into a database system. The data includes an entry for each request to the Internet server, including information iden- 
tifying which user submitted the request and information identifying the time at which the request was received. The 
database system recreates the actions, or clickstream, of a particular user by selecting all entries associated with that 
user and corresponding to a single user session. 

20 [0004] Other features and advantages will become apparent from the description and daims that follow. 

[0005] According to a first aspect of the present invention there is provided a method for use in tracking the actions 
of an Internet user, the method comprising: loading data from a transaction log of an Internet server into a database 
system, where the data indudes an entry for each request to the Internet server, induding information identifying which 
user submitted the request and information identifying the time at which the request was received; and selecting from 

25 the data all entries assodated with a particular user and corresponding to a single session of that user. 

[0008] According to a second aspect of the present invention there is provided a computer program, stored on a 
tangible storage medium, for use in tracking the actions of an Internet user, the program comprising executable in- 
structions that cause a computer to: load data from a transaction log of an Internet server into a database system, 
where the data includes an entry for each request to the Internet server, including information identifying which user 

30 submitted the request and information identifying the time at which the request was received; and select from the data 
all entries associated with a particular user and corresponding to a single session of that user. 
[0007] According to a third aspect of the present invention there is provided a database system comprising: one or 
more data-storage fadlities for use in storing data received from a transaction log of an Internet server computer, where 
the data includes an entry for each request to the Internet server computer, induding information identifying which user 

35 submitted the request and information identifying the time at which the request was received; and one or more process- 
ing modules configured to manage the data stored in the data-storage fadlities; and a database-management com- 
ponent configured to select from the data all entries associated with a particular user and corresponding to a single 
session of that user. 

[0008] An embodiment of the present invention will now be described, by way of example, with reference to the 
40 accompanying drawings, in which: 

FIGS. 1 and 2 are schematic diagrams of a system for use in capturing and analyzing web-log data from Internet 
servers; and 

FIG. 3 is a flow chart of a technique for use in reconstructing the dickstreams of visitors to an Internet site in 
45 accordance with the present invention. 

[0009] FIG. 1 shows a system for use in capturing and analyzing the data stored in the Web log of a typical Internet 
server. In general, one or more customers of an Internet-based business, using one or more dient computing systems 
105, 110, visit the business 1 Web servers 115, 120 through the Internet 125. The Web servers 115, 120 catalog every 
so piece of information requested by the dient systems 1 05, 1 1 0 in Web logs 1 30, 1 35. Table I below shows the types of 
entries found in a typical Web log. 
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[04/03/00 15:58t3B:4 user l«ip. address. l{81ce9636}Thread-56 1 9548081073871 system: 
Executing TestMain 

104/03/00 15:58:38:7 userMip. address. 2 { 8b9a63ad) Thread-46 1 954808118796) system: 
Executing OLAHasterPage2 

(04/03/00 15:58:38:8 user2dip . address . 2 {8b9a63ad} Thread- 46 1 954808118796] system: 
Executing 0LAMasterPage2 

[04/03/00 15:58:40:3 user3»ip. address. 3 (004a6ebe}Thread-46 1 954808120281] system: 
Executing Test2Maln 

[04/03/00 15:59:00:3 user4©ip. address. 4 {05cl3dBe}Thread-40| 954608140357] system: 
Executing Test3 

[04/03/00 15:59:06:5 userMip. address. 5 {d9e81Cl8}Thread-2B 1 9548 0814 628 9} system: 
Executing Test3 

[04/03/00 15:59:09:9 user69ip. address. 6 {4a29b2ea} Thread- 15 1 95480814 9945] system: 
Executing Test3 

(04/03/00 15:59:56:9 user7dip. address. 7 {ad23a2fd} Thread- 32 (954808166955] system: 
Executing Home 

TABLE 1 



[0010] Web-log entries usually include several pieces of information, such as a date-and-time stamp for each request 
20 submitted to the Web server, a code identifying the user or client system making the request, and the name of the 
action or information requested. In the example shown here, the first Web log entry includes the date-and-time stamp 
"04/03/00 15 58:38:4," the user-ID code "user@ip.address.1 ," and the action code °system: Execute TestMain." 
[0011] The Web servers 115, 120 maintained by the business both connect to a database management system 
(DBMS) 1 50 such as a Teradata Active Data Warehousing System available from NCR Corporation. The DBMS 1 50 
25 gathers data' from the Web logs 130, 140 maintained by the Web servers 115, 120 and uses this data to reconstruct 
the clickstreams associated with individual user sessions. 

[0012] FIG 2 shows a sample architecture for the DBMS 150. The DBMS 150 includes one or more processing 
modules205 1 m that manage the storage and retrieval of data in data-storage facilities 21 0, ^. Each of the processing 
modules 205, N manages a portion of a database that is stored in a corresponding one of the data-storage facilities 

30 210, N . Each of the data-storage facilities 210, N includes one or more disk drives. 

[001 3] As described below, the system stores Web-log data in one or more tables in the data-storage facilities 21 0, M . 
The rows 215, z of the tables are stored across multiple data-storage facilities 210,.„ N to ensure that the system 
workload is distributed evenly across the processing modules 205,... N . A parsing engine 220 organizes the storage of 
data and the distribution of table rows 215, z among the processing modules 205, . N . The parsing engine 220 also 

35 coordinates the retrieval of data from the data-storage facilities 21 0 1>N in response to queries received from a user at 
a mainframe 230 or a client computer 235. The DBMS 150 usually receives queries in a standard format, such as the 
Structured Query Language (SQL) put forth by the American National Standards Institute (ANSI). 
[0014] Onechallenge in reconstructing theclickstream associated with an individual customer is identifying the points 
at which the user's session began and ended or, more importantly, identifying which Web-log entries are associated 

40 with a single browser session. Because browser sessions typically end after some selected amount of inactivity 0.a f 
30 minutes) the DBMS can treat any two Web-log entries that occur within this time range and that onginate from a 
single user as though they occurred within a single user session. A DBMS function that compares the values of two 
date-and-time-stamps is useful in identifying Web-log entries that occurred within a single user session and thus that 
lie within a clickstream. The "Moving Difference" (MDIFF) extension to SQL recognized by the Teradata DBMS is one 

45 such DBMS function. t . ^ *.™.-r- r^.o 

[0015] FIG. 3 shows one technique for conducting clickstream analysis of Web-log data using the MDIFF DBMS 
function. The DBMS first loads the Web-log data from the Web servers into a single-column table (step 300). Below is 
sample SQL code for use in loading the Web-log data into the database. 

50 
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Database sessionize; 
DROP TABLE input? 
DROP TABLE input_Error_l ; 
DROP TABLE input Error_2 ; 

CREATE SET TABLE input , NO FALLBACK , 
NO BEFORE JOURNAL, 
NO AFTER JOURNAL 
( 

weblog_txt CHAR (1000)) 
PRIMARY INDEX (weblogtxt} ; 

BEGIN LOADING input 

ERRORFILES input Error 1 , 
input_Error_2 ; 
SET RECORD YARTEXT ■ | ■ ; 

DEFINE 

weblog txt (VARCHAR(IOOO) ) 

FILE » testweblog.txt; 

INSERT INTO input VALUES ( :weblog_txt) ; 

END LOADING; 
.LOGOFF 

[0016] The DBMS then parses the data to identify. the pieces of information to be extracted from each Web-log entry 
(step 305) and places this information in a table having one column for each of these pieces of information (step 31 0). 
For example, in the example above, the DBMS creates a table having three columns - one to store date-and-time 
stamps, one to store user-ID codes, and one to store the Web-log text describing the action or information requested. 
The sample SQL code below is useful in parsing the Web-log data into a three-column table. 



CREATE SET TABLE presession, NO FALLBACK, 
NO BEFORE JOURNAL, 
NO AFTER JOURNAL 
( 

user_id CHAR (50) CHARACTER SET LATIN NOT CASE SPECIFIC , 
transaction_timestamp INTEGER, 

weblog_txt CHAR (500) CHARACTER SET LATIN NOT CASESPECIFIC) 
PRIMARY INDEX { user_id , transact ion_timestainp ) ; 

INSERT INTO presession 

SELECT ( SUBSTR (weblog_txt , 21, { INDEX (weblog_txt , ■ { ' ) -21) ) ) 

, ( SUBSTR (weblog_txt ,2,9) (DATE , FORMAT 1 MM/DD/YY • ) (INTEGER) ) ♦ 

(SUBSTR(weblog_tXt,ll f 8) (FLOAT, FORMAT '99:99:99') (INTE< 
, ( SUBSTR (weblog_tXt , ( INDEX ( weblog txt , 1 { ' ) ) ,300) ) 
FROM input teat 



[0017] After parsing the Web-log data and extracting the desired information, the DBMS identifies all Web-log entries 
associated with an individual user session (step 315). One technique for doing so involves identifying alt entries that 
list a single user-ID code and then selecting from these the entries with date-and-time stamps that differ by less than 
some prescribed amount. The sample SQL code below uses the MDIFF function of the Teradata DBMS to determine 
when the date-and-timestamps associated with two different Web-log entries lie within 30 minutes of each other. When 
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this occurs, and when those Web-log entries Identify a single user-ID code, the DBMS concludes that the two Web- 
log entries belong to a single clickstream. 



CREATE SET TABLE sessionize - .calcsessian, NO FALLBACK 
( 

user id CHAR (50) CHARACTER SET LATIN NOT CASBSPBCIFIC, 
session_id CHAR (50) CHARACTER SET LATIN NOT CASESPECIFIC , 
session_start INTEGER, 
transaction_timestarap INTEGER, 
theradiff INTEGER, 

weblog_txt CHAR (300) CHARACTER SET LATIN HOT CASESPECIPIC) 
PRIMARY INDEX ( userid ) ; 

INSERT INTO calcsessian 
SELECT user_id, 

TRlM(user_id) 1 1 TRIM (transact ion_times tamp) , 
transact ion times tamp (INTEGER) , 
transact ion_times tamp ( INTEGER) , 

MDIFP ( transact ion_timest amp, 1 . transact ion times tan?)) (INTEGER 

veblog_txt 
FROM presession 
GROUP BY 1 

QUALIFY MDIP'P(tranBaction - tijnestaii5> / l # transact ion_timestamp) > 3000 
OR miFF(transaction_timestan^,l»transaction_tiinestainp) is null; 

INSERT into calcsessian 
SELECT a.user_jid f 

a.session_id, 

a. sesaionstart, 

b . transaction_ time stamp, 

a. themdiff , 

b. weblog_txt 
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FROM calcsession a, 

preseBsion b 
5 WHERE a. user id = b.user id 

AND b . transactioti_tiroestainp GE a.session_start 

AND b.transaction_timestainp It a.sessionjBtart + a.themdif f 

INSERT INTO calcsession 
10 SELECT a.user_id # 

a. Bession_id / 

a . session_start , 

b . transact ion_times tamp , 
a. thejodif f , 

is b - weblogtxt 

FROM calcsession a # 

presession b 
WHERE a.user_id = b.user — id 

AND (b . user_id, b . transaction timestamp , b . weblog_txt ) NOT IN 
20 ( SELECT user_id , 

transact iontimes tamp , 
weblog txt 
FROM calcsession) 
AND a.the_ndif£ IS NOLL 

25 



[0018] The DBMS then sorts the selected Web-log entries by date-and-time stamp value to recreate the clickstream 

(step 320). In some embodiments, the clickstream data itself is stored to disk for later analysis. 

[001 9] The various implementations of the invention are realized in electronic hardware, computer software, or com- 

30 binations of these technologies. Most implementations include one or more computer programs executed by a pro- 
grammable computer. In general, the computer includes one or more processors, one or more data-storage compo- 
nents {e.g., volatile and nonvolatile memory modules and persistent optical and magnetic storage devices, such as 
hard and floppy disk drives, CD-ROM drives, and magnetic tape drives), one or more input devices (e.g, t mice and 
keyboards), and one or more output devices (e.g., display consoles and printers). 

35 [0020] The computer programs include executable code that is usually stored in a persistent storage medium and 
then copied into memory at run-time. The processor executes the code by retrieving program instructions from memory 
in a prescribed order. When executing the program code, the computer receives data from the input and/or storage 
devices, performs operations on the data, and then delivers the resulting data to the output and/or storage devices. 
[0021] The text above describes one or more specific embodiments of a broader invention. The invention also is 

40 carried out in a variety of alternative embodiments and thus is not limited to those described here. For example, while 
the invention has been described here in terms of a DBMS that uses a massively parallel processing (M PP) architecture, 
other types of database systems, including those that use a symmetric multiprocessing (SMP) architecture, are also 
useful in carrying out the invention. Modifications may be incorporated without departing from the scope of the present 
invention. 

45 



Claims 



1 . A method for use in tracking the actions of an Internet user, the method comprising: 

so 

loading data from a transaction log of an Internet server into a database system, where the data includes an 
entry for each request to the Internet server, including information identifying which user submitted the request 
and information identifying the time at which the request was received; and 

selecting from the data all entries associated with a particular user and corresponding to a single session of 
55 that user. 

2. A method as claimed in claim 1 , where the step of selecting includes selecting entries with time stamps lying in a 
predetermined range. 
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3. A method as claimed in claim 1 or claim 2, including sorting the selected entries chronologically to reconstruct the 
user's clickstream. 

4. A computer program, stored on a tangible storage medium, for use in tracking the actions of an Internet user, the 
5 program comprising executable instructions that cause a computer to: 

load data from a transaction log of an Internet server into a database system, where the data includes an entry 
for each request to the Internet server, including information identifying which user submitted the request and 
information identifying the time at which the request was received; and 
10 select from the data all entries associated with a particular user and corresponding to a single session of that 

user. 

5. A program as claimed in claim 4, where, in selecting entries, the computer selects entries with time stamps lying 
in a predetermined range. 

15 

6. A program as claimed in claim 4 or claim 5, where the computer also sorts the selected entries chronologically to 
reconstruct the user's clickstream. 

7. A database system comprising: 

20 

one or more data-storage facilities for use in storing data received from a transaction log of an Internet server 
computer, where the data includes an entry for each request to the Internet server computer, including infor- 
mation identifying which user submitted the request and information identifying the time at which the request 
was received; and 

25 one or more processing modules configured to manage the data stored in the data-storage facilities; and 

a database-management component configured to select from the data all entries associated with a particular 
user and corresponding to a single session of that user. 

8. A system as claimed in claim 7, where the database-management component is configured to select entries with 
30 time stamps lying in a predetermined range. 

9. A system as claimed in claim 7 or claim 8, where the database-management component is configured to compare 
time stamps of entries and to select each entry for which the time stamp differs from the time stamp of another 
entry by less than a predetermined amount. 

35 

10. A system as claimed in any of claims 7 to 9, where the database-management component is configured to sort 
the selected entries chronologically to reconstruct the user's clickstream. 
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